Sequencing and Raw Sequence Data Quality Control    ◾    43

Trimmomatic

is

available

at

“http://www.usadellab.org/cms/index.php?page=

trimmomatic”. You can download it from the website and unzip it using the following

script:

$ wget http://www.usadellab.org/cms/uploads/supplementary/

Trimmomatic/Trimmomatic-0.39.zip

$ unzip Trimmomatic-0.39.zip

Notice that the version may change in the future. The unzipped directory is

“Trimmomatic-0.39”, where there will be two files (“LICENSE” and “trimmomatic-0.39.

jar”) and a directory (“adapters”). The file “trimmomatic-0.39.jar” is the Java executable

program that performs the preprocessing tasks and the directory “adaptors” contains the

known adaptor sequences in FASTA files. The following script uses Trimmomatic to repro-

cess the paired-end FASTQ files, then runs FastQC to generate QC reports, and finally

displays the reports on the Firefox browser:

java -jar ../Trimmomatic-0.39/trimmomatic-0.39.jar \

PE SRR957824_1.fastq SRR957824_2.fastq \

out_PE_SRR957824_1.fastq out_UPE_SRR957824_1.fastq \

out_PE_SRR957824_2.fastq out_UPE_SRR957824_2.fastq \

ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True \

LEADING:3 \

TRAILING:3 \

ILLUMINACLIP:TruSeq2-PE.fa:2:30:10 \

SLIDINGWINDOW:5:30 \

MINLEN:35

fastqc out_PE_SRR957824_1.fastq out_PE_SRR957824_2.fastq

firefox out_PE_SRR957824_1_fastqc.html out_PE_SRR957824_2_fastqc.

html

The option “PE” is used for paired end, and then the two paired-end FASTQ files

“SRR957824_1.fastq” and “SRR957824_2.fastq” were provided as inputs. The adaptors that

were detected and removed from the reads are stored in the “TruSeq3-PE.fa” file in the

“adaptors” directory. Hence, “ILLUMINACLIP:TruSeq2-PE.fa” is used to specify the file

in which the adaptor sequences are stored. The program removed the leading and trail-

ing edges of reads with low quality that is below 3 Phred quality score. The “SLIDING-

WINDOW:5:35” is used so that the program can scan the read with a 5-base wide sliding

window and remove a read when the window per base average quality score declines to

below 30. Finally, the program removes the reads that are shorter than 35 bases.

In Figures 1.36 and 1.37, notice how the quality of the two files have been improved and

also notice that the total sequence is equal in both files. However, the read lengths vary.

If for any reason we need reads of the same length as some aligners may require, we can

set “MINLEN:” to the maximum length. Since the maximum read length is 150 bases, we

can use “MINLEN:151” as follows: